Optical Character Recognition is the process in which the text from images or PDFs is extracted.

Built-in support

Since v0.103.0, Trilium has built-in support for OCR. The extracted text can be:

Supported formats

OCR in Trilium supports the following formats:

Images

PDFs

Currently only text extraction is supported and not OCR.

Office documents

The text will be extracted from the following file formats:

Configuring and triggering OCR

The OCR can be configured by going to Options →  Media and looking for the Text Extraction (OCR) section.

There are three ways to trigger the OCR:

Minimum confidence

When extracting text from an image, there is a certain level of confidence which indicates whether the extracted text appears relevant.

When the minimum confidence is set to a low percentage, the text extraction can interpret symbols and drawings incorrectly resulting in garbled text.

If the extracted text for a note or an attachment quality is lower than the minimum confidence, the OCR is disregarded.

Language management

OCR needs to be aware of the language of the content in order for it to work correctly. The reason is that each language has its own data which needs to be downloaded, and accents or other symbols will not be supported by the default language.

To configure the languages that are supported by the OCR, simply go to  Options → Language & Region and adjust the Content languages.

When there are no content languages defined, the user interface Language is used instead.

After making this change, the automatic processing or manual reprocessing will take into consideration the new languages.

To enforce the detection in a particular language for a given note, use the language attribute, similar to text content language. For Attachments, it's not possible to manually adjust the language.

Viewing extracted content for a single note

To access the extracted content of a note:

This section allows: